perm filename A04.TEX[162,RWF] blob sn#807860 filedate 1985-09-20 generic text, type C, neo UTF8
COMMENT ⊗   VALID 00002 PAGES
C REC  PAGE   DESCRIPTION
C00001 00001
C00002 00002	\magnification\magstephalf
C00010 ENDMK
C⊗;
\magnification\magstephalf
\input macro.tex
\def\today{\ifcase\month\or
  January\or February\or March\or April\or May\or June\or
  July\or August\or September\or October\or November\or December\fi
  \space\number\day, \number\year}
\baselineskip 14pt
\parskip 6pt
\rm
\line{\sevenrm a04.tex[162,phy] \today\hfill}

\bigskip
\noindent [RWF: replace the Algorithmic Entropy section of Decision Trees and
Entropy with this.]

\bigskip
\line{\bf Stoss' Entropy Theorem.\hfill}

Let $\{x↓{ij}\}$ be any positive real numbers. Define the marginal sums
$x↓i=\sum↓jx↓{ij}$, $x↓j=\sum↓ix↓{ij}$, $x=\sum↓{ij}x↓{ij}$. Then
$$\eqalign{&\qquad -\sum↓{ij}x↓{ij}\ln x↓{ij}+\sum↓{ij}x↓{ij}\ln x↓i
+\sum↓{ij}x↓{ij}\ln x↓j-\sum↓{ij}x↓{ij}\ln x\cr
\noalign{\smallskip}
&=\sum↓{ij}x↓{ij}\ln\left({x↓i x↓j\over x↓{ij}x}\right)\cr
\noalign{\smallskip}
&\qquad\hbox{(by the logarithmic inequality, $\ln u≤u-1$)}\cr
\noalign{\smallskip}
&≤\sum↓{ij}x↓{ij}\left({x↓ix↓j\over x↓{ij}x}-1\right)\cr
\noalign{\smallskip}
&=\sum↓{ij}{x↓ix↓j\over x}-\sum↓{ij}x↓{ij}\cr
\noalign{\smallskip}
&=0\,,\cr}$$
with equality only if each $x↓{ij}={x↓ix↓j\over x}$; obviously the relation
holds for other bases. Define $\phi(u)=u\log u$. Then we have the theorem:
$$\sum↓{ij}\phi(x↓{ij})+\phi(x)≥\sum↓i\phi(x↓i)+\sum↓j\phi(x↓j)$$
with equality only when each $x↓{ij}={x↓ix↓j\over x}$.

Suppose we have two queries $A$ and $B$ to make on a data set, and $x↓{ij}$
is the probability that $i$~is the answer to query~$A$, $j$~to query~$B$.
Then the entropy of query~$A$ is 
$E↓A=-\sum↓i\phi(x↓i)$, $E↓B=-\sum↓j\phi(x↓j)$, the entropy of answering
both queries is $E↓{AB}=-\sum↓{ij}x↓{ij}$, $x=1$, and Stoss' theorem states,
plausibly, that $E↓{AB}≤E↓A+E↓B$, with equality only when the queries are
independent. We can, however, get more out of the theorem than this unexciting
result.

We make a test, $A$, on a data set, as part of the process of determining
the answer to a query,~$B$. Let $x↓{ij}$ be the probability that the
outcome~$a$ of the test is~$i$ and the result~$b$ of the query is~$j$.
Then $x↓i$ is the probability that $a=i$, 
$x↓j$~is the probability that $b=j$, and $x=1$. The information
of the test is $I=-\sum↓i\phi(x↓i)$. [$H$ ?] The entropy of the query is
$E=-\sum↓j\phi(x↓j)$. If $a=i$, the conditional probability of $b=j$ is
$x↓{ij}/x↓i$, with conditional entropy $-\sum↓j\phi(x↓{ij}/x↓i)$; the 
expected entropy after the test is then $E'=-\sum↓i\left(x↓i\sum↓j\phi(x↓{ij}/
x↓i)\right)=-\sum↓{ij}\phi(x↓{ij})+\sum↓i\phi(x↓i)$. We can bound
$E'$ on both sides. By Stoss' theorem,
$E'≤-\sum↓j\phi(x↓j)=E$, so no test increases the expected entropy. A~test
fails to decrease the expected entropy only if the conditional probability of
every result, $x↓{ij}/x↓i$, is the same as the original probability~$x↓j$,
or $x↓{ij}=x↓ix↓j$.

On the other side,
$$\eqalignno{\sum↓{ij}\phi(x↓{ij})&=\sum↓j\biggl(\sum↓ix↓{ij}\log x↓{ij}\biggr)\cr
\noalign{\smallskip}
&=\sum↓jx↓j\biggl(\sum↓i{x↓{ij}\over x↓j}\log x↓{ij}\biggr)\cr
\noalign{\smallskip}
&=\sum↓jx↓j\biggl(\sum↓i{x↓{ij}\over x↓j}\log\left({x↓{ij}\over x↓j}\right)
+\sum↓i{x↓{ij}\over x↓j}\log x↓j\biggr)\cr
\noalign{\smallskip}
\noalign{\hbox{(noticing that the first inner sum is $≤0$, and 
the second is $\log x↓j$)}}
\noalign{\smallskip}
&≤\sum↓j\phi(x↓j)\cr}$$
so $E'≥E-I$, with equality only if each $x↓{ij}/x↓j$ is zero or one. That is,
the conditional probability that $a=i$, given $b=j$, must be zero or one;
the result of the query must determine the outcome of the test if the full
information of the test is to be useful.

The expected reduction in entropy by the test is $E-E'=\sum↓{ij}\log
\left({x↓{ij}\over x↓ix↓j}\right)$; where this can be evaluated, it defines
a greedy algorithm for choosing tests, assuming all tests take the same
amount of time. If they take different times, then ${E-E'\over t}$ 
(entropy reduction per unit time) can be used as a measure of efficiency.

{\rmn
\narrower\smallskip\noindent
{\bf Example:} We are sorting data $(d↓1,d↓2,d↓3,d↓4)$, and have determined
that $d↓1>d↓2$. The result determines the outcome of all tests $d↓i$:$d↓j$,
so the full information of any test is used. The test $d↓3$:$d↓4$ partitions
the 12~outcomes into 6 and~6, so the information is 1.00; the test $d↓1$:$d↓3$
partitions the outcomes into 8 and~4, so the information is 0.918. If
using $d↓1$ again saves a little time (e.g., avoids a register fetch), the
best test is probably $d↓1$:$d↓3$.
\smallskip}

\bigskip
\parindent0pt
\copyright 1984 Robert W. Floyd

First draft April 1, 1985 

\bye